Db2 Jupyter Notebook Extensions Tutorial

The SQL code tutorials for Db2 rely on a Jupyter notebook extension, commonly refer to as a "magic" command. The beginning of all of the notebooks begin with the following command which will load the extension and allow the remainder of the notebook to use the %sql magic command.

%run db2.ipynb
The cell below will load the Db2 extension. Note that it will take a few seconds for the extension to load, so you should generally wait until the "Db2 Extensions Loaded" message is displayed in your notebook.


In [ ]:
%run db2.ipynb

Options

There are two options that can be set with the %sql command. These options are:

  • MAXROWS n - The maximum number of rows that you want to display as part of a SQL statement. Setting MAXROWS to -1 will return all output, while maxrows of 0 will suppress all output.
  • RUNTIME n - When using the timer option on a SQL statement, the statement will execute for n number of seconds. The result that is returned is the number of times the SQL statement executed rather than the execution time of the statement. The default value for runtime is one second, so if the SQL is very complex you will need to increase the run time.

To set an option use the following syntax:

%sql option option_name value option_name value ....

The following example sets all three options:

%sql option maxrows 100 runtime 2

The values will be saved between Jupyter notebooks sessions.

Connections to Db2

Before any SQL commands can be issued, a connection needs to be made to the Db2 database that you will be using. The connection can be done manually (through the use of the CONNECT command), or automatically when the first %sql command is issued.

The Db2 magic command tracks whether or not a connection has occured in the past and saves this information between notebooks and sessions. When you start up a notebook and issue a command, the program will reconnect to the database using your credentials from the last session. In the event that you have not connected before, the system will prompt you for all the information it needs to connect. This information includes:

  • Database name (SAMPLE)
  • Hostname - localhost (enter an IP address if you need to connect to a remote server)
  • PORT - 50000 (this is the default but it could be different)
  • Userid - DB2INST1
  • Password - No password is provided so you have to enter a value
  • Maximum Rows - 10 lines of output are displayed when a result set is returned

There will be default values presented in the panels that you can accept, or enter your own values. All of the information will be stored in the directory that the notebooks are stored on. Once you have entered the information, the system will attempt to connect to the database for you and then you can run all of the SQL scripts. More details on the CONNECT syntax will be found in a section below.

If you have credentials available from Db2 on Cloud or DSX, place the contents of the credentials into a variable and then use the CONNECT CREDENTIALS <var> syntax to connect to the database.

db2blu = { "uid" : "xyz123456", ...}
%sql CONNECT CREDENTIALS db2blu

If the connection is successful using the credentials, the variable will be saved to disk so that you can connected from within another notebook using the same syntax.

The next statement will force a CONNECT to occur with the default values. If you have not connected before, it will prompt you for the information.


In [ ]:
%sql CONNECT

Line versus Cell Command

The Db2 extension is made up of one magic command that works either at the LINE level (%sql) or at the CELL level (%%sql). If you only want to execute a SQL command on one line in your script, use the %sql form of the command. If you want to run a larger block of SQL, then use the %%sql form. Note that when you use the %%sql form of the command, the entire contents of the cell is considered part of the command, so you cannot mix other commands in the cell.

The following is an example of a line command:


In [ ]:
%sql VALUES 'HELLO THERE'

If you have SQL that requires multiple lines, of if you need to execute many lines of SQL, then you should be using the CELL version of the %sql command. To start a block of SQL, start the cell with %%sql and do not place any SQL following the command. Subsequent lines can contain SQL code, with each SQL statement delimited with the semicolon (;). You can change the delimiter if required for procedures, etc... More details on this later.


In [ ]:
%%sql
VALUES
  1,
  2,
  3

If you are using a single statement then there is no need to use a delimiter. However, if you are combining a number of commands then you must use the semicolon.


In [ ]:
%%sql
DROP TABLE STUFF;
CREATE TABLE STUFF (A INT);
INSERT INTO STUFF VALUES
  1,2,3;
SELECT * FROM STUFF;

The script will generate messages and output as it executes. Each SQL statement that generates results will have a table displayed with the result set. If a command is executed, the results of the execution get listed as well. The script you just ran probably generated an error on the DROP table command.

Options

Both forms of the %sql command have options that can be used to change the behavior of the code. For both forms of the command (%sql, %%sql), the options must be on the same line as the command:

%sql -t ...
%%sql -t

The only difference is that the %sql command can have SQL following the parameters, while the %%sql requires the SQL to be placed on subsequent lines.

There are a number of parameters that you can specify as part of the %sql statement.

  • -d - Use alternative delimiter
  • -t - Time the statement execution
  • -q - Suppress messages
  • -j - JSON formatting of a column
  • -a - Show all output
  • -pb - Bar chart of results
  • -pp - Pie chart of results
  • -pl - Line chart of results
  • -i - Interactive mode with Pixiedust
  • -sampledata Load the database with the sample EMPLOYEE and DEPARTMENT tables
  • -r - Return the results into a variable (list of rows)
  • -e - Echo macro substitution

Multiple parameters are allowed on a command line. Each option should be separated by a space:

%sql -a -j ...

A SELECT statement will return the results as a dataframe and display the results as a table in the notebook. If you use the assignment statement, the dataframe will be placed into the variable and the results will not be displayed:

r = %sql SELECT * FROM EMPLOYEE

The sections below will explain the options in more detail.

Delimiters

The default delimiter for all SQL statements is the semicolon. However, this becomes a problem when you try to create a trigger, function, or procedure that uses SQLPL (or PL/SQL). Use the -d option to turn the SQL delimiter into the at (@) sign and -q to suppress error messages. The semi-colon is then ignored as a delimiter.

For example, the following SQL will use the @ sign as the delimiter.


In [ ]:
%%sql -d -q
DROP TABLE STUFF
@
CREATE TABLE STUFF (A INT)
@
INSERT INTO STUFF VALUES
  1,2,3
@
SELECT * FROM STUFF
@

The delimiter change will only take place for the statements following the %%sql command. Subsequent cells in the notebook will still use the semicolon. You must use the -d option for every cell that needs to use the semicolon in the script.

Limiting Result Sets

The default number of rows displayed for any result set is 10. You have the option of changing this option when initially connecting to the database. If you want to override the number of rows display you can either update the control variable, or use the -a option. The -a option will display all of the rows in the answer set. For instance, the following SQL will only show 10 rows even though we inserted 15 values:


In [ ]:
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

You will notice that the displayed result will split the visible rows to the first 5 rows and the last 5 rows. Using the -a option will display all values in a scrollable table.


In [ ]:
%sql -a values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

To change the default value of rows displayed, you can use the %sql option maxrow command to set the value to something else. A value of 0 or -1 means unlimited output.


In [ ]:
%sql option maxrows 5
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

A special note regarding the output from a SELECT statement. If the SQL statement is the last line of a block, the results will be displayed by default (unless you assigned the results to a variable). If the SQL is in the middle of a block of statements, the results will not be displayed. To explicitly display the results you must use the display function (or pDisplay if you have imported another library like pixiedust which overrides the pandas display function).


In [ ]:
# Set the maximum back
%sql option maxrows 10
%sql values 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

Quiet Mode

Every SQL statement will result in some output. You will either get an answer set (SELECT), or an indication if the command worked. For instance, the following set of SQL will generate some error messages since the tables will probably not exist:


In [ ]:
%%sql
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;

If you know that these errors may occur you can silence them with the -q option.


In [ ]:
%%sql -q
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;

SQL output will not be suppressed, so the following command will still show the results.


In [ ]:
%%sql -q
DROP TABLE TABLE_NOT_FOUND;
DROP TABLE TABLE_SPELLED_WRONG;
VALUES 1,2,3;

Variables in %sql Blocks

Python variables can be passed to a %sql line command, and to a %%sql block. For both forms of the %sql command you can pass variables by placing a colon in front of the variable name.

%sql SELECT * FROM EMPLOYEE WHERE EMPNO = :empno

The following example illustrates the use of a variable in the SQL.


In [ ]:
empno = '000010'
%sql SELECT * FROM EMPLOYEE WHERE EMPNO = :empno

You can doublecheck that the substitution took place by using the -e option which echos the SQL command after substitution.


In [ ]:
%sql -e SELECT * FROM EMPLOYEE WHERE EMPNO = :empno

Note that the variable :empno did not have quotes around it, although it is a string value. The %sql call will examine the contents of the variable and add quotes around strings so you do not have to supply them in the SQL command.

Variables can also be array types. Arrays are expanded into multiple values, each separated by commas. This is useful when building SQL IN lists. The following example searches for 3 employees based on their employee number.


In [ ]:
empnos = ['000010','000020','000030']
%sql SELECT * FROM EMPLOYEE WHERE EMPNO IN (:empnos)

You can reference individual array items using this technique as well. If you wanted to search for only the first value in the empnos array, use :empnos[0] instead.


In [ ]:
%sql SELECT * FROM EMPLOYEE WHERE EMPNO IN (:empnos[0])

One final type of variable substitution that is allowed is for dictionaries. Python dictionaries resemble JSON objects and can be used to insert JSON values into Db2. For instance, the following variable contains company information in a JSON structure.


In [ ]:
customer = {
    "name" : "Aced Hardware Stores",
    "city" : "Rockwood",
    "employees" : 14
}

Db2 has builtin functions for dealing with JSON objects. There is another Jupyter notebook which goes through this in detail. Rather than using those functions, the following code will create a Db2 table with a string column that will contain the contents of this JSON record.


In [ ]:
%%sql
DROP TABLE SHOWJSON;
CREATE TABLE SHOWJSON (INJSON VARCHAR(256));

To insert the Dictionary (JSON Record) into this Db2 table, you only need to use the variable name as one of the fields being inserted.


In [ ]:
%sql INSERT INTO SHOWJSON VALUES :customer

Selecting from this table will show that the data has been inserted as a string.


In [ ]:
%sql select * from showjson

If you want to retrieve the data from a column that contains JSON records, you must use the -j flag to insert the contents back into a variable.


In [ ]:
v = %sql -j SELECT * FROM SHOWJSON

The variable v now contains the original JSON record for you to use.


In [ ]:
v

SQL Character Strings

Character strings require special handling when dealing with Db2. The single quote character ' is reserved for delimiting string constants, while the double quote " is used for naming columns that require special characters. You cannot use the double quote character to delimit strings that happen to contain the single quote character. What Db2 requires you do is placed two quotes in a row to have them interpreted as a single quote character. For instance, the next statement will select one employee from the table who has a quote in their last name: O'CONNELL.


In [ ]:
%sql SELECT * FROM EMPLOYEE WHERE LASTNAME = 'O''CONNELL'

Python handles quotes differently! You can assign a string to a Python variable using single or double quotes. The following assignment statements are not identical!


In [ ]:
lastname = "O'CONNELL"
print(lastname)
lastname = 'O''CONNELL'
print(lastname)

If you use the same syntax as Db2, Python will remove the quote in the string! It interprets this as two strings (O and CONNELL) being concatentated together. That probably isn't what you want! So the safest approach is to use double quotes around your string when you assign it to a variable. Then you can use the variable in the SQL statement as shown in the following example.


In [ ]:
lastname = "O'CONNELL"
%sql -e SELECT * FROM EMPLOYEE WHERE LASTNAME = :lastname

Notice how the string constant was updated to contain two quotes when inserted into the SQL statement. This is done automatically by the %sql magic command, so there is no need to use the two single quotes when assigning a string to a variable. However, you must use the two single quotes when using constants in a SQL statement.

Builtin Variables

There are 5 predefined variables defined in the program:

  • database - The name of the database you are connected to
  • uid - The userid that you connected with
  • hostname = The IP address of the host system
  • port - The port number of the host system
  • max - The maximum number of rows to return in an answer set

Theses variables are all part of a structure called _settings. To retrieve a value, use the syntax:

db = _settings['database']

There are also 3 variables that contain information from the last SQL statement that was executed.

  • sqlcode - SQLCODE from the last statement executed
  • sqlstate - SQLSTATE from the last statement executed
  • sqlerror - Full error message returned on last statement executed

You can access these variables directly in your code. The following code segment illustrates the use of the SQLCODE variable.


In [ ]:
empnos = ['000010','999999']
for empno in empnos:
    ans1 = %sql -r SELECT SALARY FROM EMPLOYEE WHERE EMPNO = :empno
    if (sqlcode != 0):
        print("Employee "+ empno + " left the company!")
    else:
        print("Employee "+ empno + " salary is " + str(ans1[1][0]))

Timing SQL Statements

Sometimes you want to see how the execution of a statement changes with the addition of indexes or other optimization changes. The -t option will run the statement on the LINE or one SQL statement in the CELL for exactly one second. The results will be displayed and optionally placed into a variable. The syntax of the command is:

sql_time = %sql -t SELECT * FROM EMPLOYEE
For instance, the following SQL will time the VALUES clause.


In [ ]:
%sql -t VALUES 1,2,3,4,5,6,7,8,9

When timing a statement, no output will be displayed. If your SQL statement takes longer than one second you will need to modify the runtime options. You can use the %sql option runtime command to change the duration the statement runs.


In [ ]:
%sql option runtime 5
%sql -t VALUES 1,2,3,4,5,6,7,8,9
%sql option runtime 1

JSON Formatting

Db2 supports querying JSON that is stored in a column within a table. Standard output would just display the JSON as a string. For instance, the following statement would just return a large string of output.


In [ ]:
%%sql 
VALUES 
      '{
      "empno":"000010",
      "firstnme":"CHRISTINE",
      "midinit":"I",
      "lastname":"HAAS",
      "workdept":"A00",
      "phoneno":[3978],
      "hiredate":"01/01/1995",
      "job":"PRES",
      "edlevel":18,
      "sex":"F",
      "birthdate":"08/24/1963",
      "pay" : {
        "salary":152750.00,
        "bonus":1000.00,
        "comm":4220.00}
      }'

Adding the -j option to the %sql (or %%sql) command will format the first column of a return set to better display the structure of the document. Note that if your answer set has additional columns associated with it, they will not be displayed in this format.


In [ ]:
%%sql -j 
VALUES 
      '{
      "empno":"000010",
      "firstnme":"CHRISTINE",
      "midinit":"I",
      "lastname":"HAAS",
      "workdept":"A00",
      "phoneno":[3978],
      "hiredate":"01/01/1995",
      "job":"PRES",
      "edlevel":18,
      "sex":"F",
      "birthdate":"08/24/1963",
      "pay" : {
        "salary":152750.00,
        "bonus":1000.00,
        "comm":4220.00}
      }'

JSON fields can be inserted into Db2 columns using Python dictionaries. This makes the input and output of JSON fields much simpler. For instance, the following code will create a Python dictionary which is similar to a JSON record.


In [ ]:
employee = {
    "firstname" : "John",
    "lastname" : "Williams",
    "age" : 45
}

The field can be inserted into a character column (or BSON if you use the JSON functions) by doing a direct variable insert.


In [ ]:
%%sql -q
DROP TABLE SHOWJSON;
CREATE TABLE SHOWJSON(JSONIN VARCHAR(128));

An insert would use a variable parameter (colon in front of the variable) instead of a character string.


In [ ]:
%sql INSERT INTO SHOWJSON VALUES (:employee)
%sql SELECT * FROM SHOWJSON

An assignment statement to a variable will result in an equivalent Python dictionary type being created. Note that we must use the raw -j flag to make sure we only get the data and not a data frame.


In [ ]:
x = %sql -j SELECT * FROM SHOWJSON
print("First Name is " + x[0]["firstname"] + " and the last name is " + x[0]['lastname'])

Plotting

Sometimes it would be useful to display a result set as either a bar, pie, or line chart. The first one or two columns of a result set need to contain the values need to plot the information.

The three possible plot options are:

  • -pb - bar chart (x,y)
  • -pp - pie chart (y)
  • -pl - line chart (x,y)

The following data will be used to demonstrate the different charting options.


In [ ]:
%sql values 1,2,3,4,5

Since the results only have one column, the pie, line, and bar charts will not have any labels associated with them. The first example is a bar chart.


In [ ]:
%sql -pb values 1,2,3,4,5

The same data as a pie chart.


In [ ]:
%sql -pp values 1,2,3,4,5

And finally a line chart.


In [ ]:
%sql -pl values 1,2,3,4,5

If you retrieve two columns of information, the first column is used for the labels (X axis or pie slices) and the second column contains the data.


In [ ]:
%sql -pb values ('A',1),('B',2),('C',3),('D',4),('E',5)

For a pie chart, the first column is used to label the slices, while the data comes from the second column.


In [ ]:
%sql -pp values ('A',1),('B',2),('C',3),('D',4),('E',5)

Finally, for a line chart, the x contains the labels and the y values are used.


In [ ]:
%sql -pl values ('A',1),('B',2),('C',3),('D',4),('E',5)

The following SQL will plot the number of employees per department.


In [ ]:
%%sql -pb
SELECT WORKDEPT, COUNT(*) 
  FROM EMPLOYEE
GROUP BY WORKDEPT

The final option for plotting data is to use interactive mode -i. This will display the data using an open-source project called Pixiedust. You can view the results in a table and then interactively create a plot by dragging and dropping column names into the appropriate slot. The next command will place you into interactive mode.


In [ ]:
%sql -i select * from employee

Sample Data

Many of the Db2 notebooks depend on two of the tables that are found in the SAMPLE database. Rather than having to create the entire SAMPLE database, this option will create and populate the EMPLOYEE and DEPARTMENT tables in your database. Note that if you already have these tables defined, they will not be dropped.


In [ ]:
%sql -sampledata

Result Sets

By default, any %sql block will return the contents of a result set as a table that is displayed in the notebook. The results are displayed using a feature of pandas dataframes. The following select statement demonstrates a simple result set.


In [ ]:
%sql select * from employee fetch first 3 rows only

You can assign the result set directly to a variable.


In [ ]:
x = %sql select * from employee fetch first 3 rows only

The variable x contains the dataframe that was produced by the %sql statement so you access the result set by using this variable or display the contents by just referring to it in a command line.


In [ ]:
x

There is an additional way of capturing the data through the use of the -r flag.

var = %sql -r select * from employee
Rather than returning a dataframe result set, this option will produce a list of rows. Each row is a list itself. The column names are found in row zero (0) and the data rows start at 1. To access the first column of the first row, you would use var[1][0] to access it.


In [ ]:
rows = %sql -r select * from employee fetch first 3 rows only
print(rows[1][0])

The number of rows in the result set can be determined by using the length function and subtracting one for the header row.


In [ ]:
print(len(rows)-1)

If you want to iterate over all of the rows and columns, you could use the following Python syntax instead of creating a for loop that goes from 0 to 41.


In [ ]:
for row in rows:
    line = ""
    for col in row:
        line = line + str(col) + ","
    print(line)

If you don't want the header row, modify the first line to start at the first row instead of row zero.


In [ ]:
for row in rows[1:]:
    line = ""
    for col in row:
        line = line + str(col) + ","
    print(line)

Since the data may be returned in different formats (like integers), you should use the str() function to convert the values to strings. Otherwise, the concatenation function used in the above example will fail. For instance, the 6th field is a birthdate field. If you retrieve it as an individual value and try and concatenate a string to it, you get the following error.


In [ ]:
try:
    print("Birth Date="+rows[1][6])
except Exception as err:
    print("Oops... Something went wrong!")
    print(err)

You can fix this problem by adding the str function to convert the date.


In [ ]:
print("Birth Date="+str(rows[1][6]))

Development SQL

The previous set of %sql and %%sql commands deals with SQL statements and commands that are run in an interactive manner. There is a class of SQL commands that are more suited to a development environment where code is iterated or requires changing input. The commands that are associated with this form of SQL are:

  • AUTOCOMMIT
  • COMMIT/ROLLBACK
  • PREPARE
  • EXECUTE

In addition, the sqlcode, sqlstate and sqlerror fields are populated after every statement so you can use these variables to test for errors.

Autocommit is the default manner in which SQL statements are executed. At the end of the successful completion of a statement, the results are commited to the database. There is no concept of a transaction where multiple DML/DDL statements are considered one transaction. The AUTOCOMMIT command allows you to turn autocommit OFF or ON. This means that the set of SQL commands run after the AUTOCOMMIT OFF command are executed are not commited to the database until a COMMIT or ROLLBACK command is issued.

COMMIT (WORK) will finalize all of the transactions (COMMIT) to the database and ROLLBACK will undo all of the changes. If you issue a SELECT statement during the execution of your block, the results will reflect all of your changes. If you ROLLBACK the transaction, the changes will be lost.

PREPARE is typically used in a situation where you want to repeatidly execute a SQL statement with different variables without incurring the SQL compilation overhead. For instance:

x = %sql PREPARE SELECT LASTNAME FROM EMPLOYEE WHERE EMPNO=?
for y in ['000010','000020','000030']:
    %sql execute :x using :y

EXECUTE is used to execute a previously compiled statement.

Db2 CONNECT Statement

As mentioned at the beginning of this notebook, connecting to Db2 is automatically done when you issue your first %sql statement. Usually the program will prompt you with what options you want when connecting to a database. The other option is to use the CONNECT statement directly. The CONNECT statement is similar to the native Db2 CONNECT command, but includes some options that allow you to connect to databases that has not been catalogued locally.

The CONNECT command has the following format:

%sql CONNECT TO <database> USER <userid> USING <password | ?> HOST <ip address> PORT <port number>
If you use a "?" for the password field, the system will prompt you for a password. This avoids typing the password as clear text on the screen. If a connection is not successful, the system will print the error message associated with the connect request.

If the connection is successful, the parameters are saved on your system and will be used the next time you run a SQL statement, or when you issue the %sql CONNECT command with no parameters.

If you want to force the program to connect to a different database (with prompting), use the CONNECT RESET command. The next time you run a SQL statement, the program will prompt you for the the connection and will force the program to reconnect the next time a SQL statement is executed.

Credits: IBM 2018, George Baklarz [baklarz@ca.ibm.com]